06 - Object detection - how to train own model

Robotics I

Poznan University of Technology, Institute of Robotics and Machine Intelligence

Laboratory 6: Object detection - how to train own model

Goals

The objectives of this laboratory are to:

Resources

How to define an object detection task?


Source: Top 10 Object Detection Models in 2023!

Object detection is a computer vision task that involves localizing one or more objects within an image and classifying each object in the image. The goal is to find the bounding box (rectangle) coordinates of each object in the image along with its class label.

Usually, object detection bounding boxes are defined: - using left-top corner (x1, y1) and right-bottom corner (x2, y2) coordinates: (x1, y1, x2, y2) - using left-top corner (x, y) and width and height: (x, y, w, h) - using center (x, y) and width and height: (x, y, w, h)

Additionally, object detection results contain: - confidence score: a value that represents the probability that the detected object exists in the bounding box (“objectness score”) - class label: a label that represents the class of the detected object, usually represented as an integer value or a list of class probabilities

Note: Be careful with the bounding box format used in the dataset you are working with. If you are not sure, check the dataset documentation or visualize the bounding boxes to understand the format.

Object detection architectures


Source: Semantic Image Cropping

Object detection architectures can be divided into two main categories:

Usually, two-stage detectors are more accurate but slower than one-stage detectors. The choice of the architecture depends on the application requirements, such as speed and accuracy. In robotics, the choice of the architecture depends on the robot’s computational resources and the task requirements. Still, one-stage detectors are usually preferred due to their speed and ability to run in real time.

YOLO (You Only Look Once)

One of the most popular one-stage object detection architectures, especially in robotics and real-time applications, is YOLO (You Only Look Once). YOLO is a series of fast and accurate object detection models. The first version of YOLO was introduced in 2016, and since then, several versions have been released, evolving the architecture and improving performance.


Source: The AiEdge+: Let’s Make Computer Vision Great Again!

The model is a simple convolutional network with the output of the last convolution layer having the dimensionality of the target to predict. This means that for each cell, the model will predict if there is an object (the center of a bounding box), the probability for each of the classes and the dimensions and positions of the resulting bounding boxes, this for each of the priors.

Because the model will likely predict multiple bounding boxes for the same object, it is necessary to select the best ones. The idea is to choose the box with the highest confidence score, measure its intersection area over union area (IOU) with all the other overlapping boxes of the same class, and remove all that are above a certain threshold. This is called non-maximum suppression. This ensures that boxes with high overlaps are merged into one.

Source: The AiEdge+: Let’s Make Computer Vision Great Again!

Note: Everything we do today should be done inside the container!

💥 💥 💥 Task 💥 💥 💥

In this task, you will train your own object detection model using the YOLO architecture.

Requirements

A graphics processing unit (GPU) is required to train the object detection model. If you don’t have an NVIDIA GPU, you can use the CPU version of the container, but the training process will be very slow.

Preparation

  1. This container is prepared for two laboratories. For today’s laboratory, go to the model_training directory, where all necessary scripts are located.

  2. Download the dataset. This script will download the validation dataset from the COCO dataset (about 5000 images). The whole COCO dataset contains 118k images; therefore, we will use a subset of the dataset to speed up the training process during the laboratory.

bash scripts/01_download_dataset.bash

The script generates the datasets directory with the following structure:

datasets/
├── coco_val2017
│   ├── images
│   │   ├── 000000000139.jpg
│   │   ├── ...
│   │   └── 000000581781.jpg
│   └── labels
│       ├── 000000000139.txt
│       ├── ...
│       └── 000000581781.txt

Check out the sample labels. The labels are in the YOLO format, where the first column is the class index, and the following four columns are the bounding box coordinates in the format:

class x_center y_center width height

Note: Box coordinates are normalized to the image width and height, so they are in the range [0, 1].

  1. The original COCO dataset contains 80 classes, too many for our purposes. Therefore, we will filter the dataset to keep labels only for the following classes: `‘person’, ‘bicycle’, ‘car’, ‘motorcycle’, ‘bus’, and ‘truck’.

Check the scripts/02_filter_labels.py script to see the instructions. When you fill all gaps in the script, run the command:

bash scripts/02_filter_labels.py

Validate an example label to see if the script works correctly.

  1. As we use the validation subset of the COCO dataset, we need to split the dataset into training and validation sets manually. Check the scripts/03_split_dataset.py script to see the instructions. When you fill all gaps in the script, run the command:
bash scripts/03_split_dataset.py

It generates train_list.txt and val_list.txt files in the datasets directory. Check the files to see if the script works correctly.

Note: Validate paths and classes in the configs/coco128_filtered.yaml file. It will be used in the next steps.

Training and validation

Note: All following scripts use a GPU. If you don’t have a GPU, you can use the device=cpu parameter with every command. Using neural networks on a CPU is very slow, so be patient or reduce parameters like imgsz or batch.

  1. Train the YOLO model using the following command. It uses the pre-trained model yolo11.pt weights, which should improve the training results and speed up the process. You can find all training parameter definitions here.
yolo detect train data=configs/coco128_filtered.yaml model=yolo11n.pt epochs=20 imgsz=512 batch=16

During the training process, you can check the runs/detect/ directory to see the training progress and generated plots.

  1. When the training process is finished, you can evaluate the model using the validation dataset. For laboratory purposes, we ignore the testing subset, just evaluating on the validation subset.
yolo val data=configs/coco128_filtered.yaml imgsz=640 batch=16 conf=0.25 iou=0.6 split=val model=<PATH_TO_BEST_MODEL.PT>

Save the evaluation results to the results.txt file.

  1. At the last step, you can run the model on your own data. By specifying the source parameter, you can use an image, video, or camera stream, a file on your disk, or a URL to the image or video.
yolo predict data=configs/coco128_filtered.yaml imgsz=640 model=<PATH_TO_BEST_MODEL.PT> source="<URL>"

💥 💥 💥 Assignment 💥 💥 💥

To pass the course, you need to upload the following files to the eKursy platform: